A Grid Portal Implementation for Genetic Mapping of Multiple QTL
نویسندگان
چکیده
We present a grid portal implementation of a parallel algorithm for multidimensional QTL mapping problems. The implementation uses the LUNARC application portal framework which is modified to manage the multiple job submission used in the QTL mapping algorithm. key words: QTL analysis, grid computing 1 Genetic mapping of QTL Most traits of medical or economic importance are quantitative. Examples are agricultural crop yield, growth rate in farm animals and blood pressure and cholesterol levels in humans. These traits are generally believed to be governed by a complex interplay between multiple genetic factors and the environment. One method to locate the genetic regions underlying a quantitative trait is known as Quantitative Trait Locus (QTL) mapping. A QTL is a DNA region (locus, pl. loci), harboring a gene or a regulatory element affecting a quantitative trait. In a standard QTL mapping study, genetic data (genotype data) from an experimental population is used as input to a statistical model of the measured trait (phenotype data). The model fit and significance tests are performed using numerical algorithms implemented in a QTL mapping software. A review of QTL mapping methods is given in [6]. Finding the most likely positions of d QTL influencing a trait corresponds to minimization of a d-dimensional non-convex objective function (the outer problem) which is defined by the QTL model fit (the inner problem). 2 Salman Toor, Mahen Jayawardena, Jonas Lindemann and Sverker Holmgren A popular approach for computing the model fit is the linear regression method [7,10,14], where a single least-squares problem is solved for each objective function evaluation. Efficient numerical algorithms for solving these leastsquares problems in the QTL mapping setting are considered in [12, 13]. In standard QTL mapping software [2,3,11,15], the outer problem is solved using an exhaustive grid search. The computational requirement for this type of algorithm is O(d2Gd), where the number of grid points G is of the order 10. This type of scheme is reliable but prohibitively slow for d > 2, which has resulted in that high-dimensional searches have so far not been used in practice. In this paper, we combine the efficient optimization scheme presented in [12] with the parallelization techniques presented in [9] and [8] and implement the resulting algorithm using the LUNARC grid portal framework. It should be noted that already today, geneticists routinely fit models with multiple QTL. This is performed using a forward selection procedure where an identified QTL is included as a known quantity when searching for an additional QTL. In this way it is possible to search for d QTL by a sequence of d one-dimensional exhaustive grid searches. For general QTL models, it is not clear how accurate this technique is. It could be anticipated that the forward selection scheme can fail to detect QTL that only affect the phenotype through interactions with other QTL. Several analysis of real data sets have revealed such interactions between pairs of QTL, some of which were only detectable by solving the full two-dimensional optimization problem [5, 16, 17]. Such results motivate our interest for developing efficient algorithms also for high-dimensional QTL mapping problems. In the experiments presented in this paper, we use the parallel code to search for potential QTL positions in data from an experimental intercross between European wild boars and white domestic pigs consisting of 191 individuals [1]. The pig genome has 18 chromosomes, and its total length is ∼ 2300 cM. In the computations we use models with d QTL, including both marginal and epistatic effects. In this paper, we use this set of data and models as representative examples. We do not consider the relevance of the models used, nor do we consider the problem of which statistical model to use. Also, we do not attempt to establish the statistical significance of the results, and we do not draw any form of genetic implications from the computations. However, the code described in this paper provides a basis for future studies of all these issues, and for performing complete QTL mapping analysis using models including many QTL. The search for the best QTL model fit should in principle be solved by optimizing over all positions x in a d-dimensional hypercube where the side is given by the size of the genome. The genome is divided into C chromosomes, re4 A standard unit of genetic distance is Morgan [M]. However, distances are often reported in centi-Morgan [cM] 5 Marginal effects are additive, i.e. the combined effect from two loci equals the sum of the individual effects. For epistatic effects, the relationship is nonlinear A Grid portal for Genetic Mapping of QTL 3 sulting in that the search space hypercube consists of a set of C d-dimensional unequally sized chromosome combination boxes, cc-boxes. A cc-box can be identified by a vector of chromosome numbers c = [c1 c2 . . . cd], and consists of all x for which xj is a point on chromosome cj . The ordering of the loci does not affect the model fit, and this symmetry can be used to reduce the search space. We can restrict the search to cc-boxes identified by non-decreasing sequences of chromosomes. In addition, in cc-boxes where two or more edges span the same chromosome, for example c = [1 8 8], we need only consider a part of the box. Since genes on different chromosomes are unlinked, the objective function is normally discontinuous at the cc-box boundaries. This means that the QTL search could be viewed as essentially consisting of n ≈ C/d! independent global optimization problems, one for each cc-box included in the search space. This partitioning of the problem is a natural basis for a straight-forward parallelization of multi-dimensional QTL searches: do (in parallel) i=1:n l_sol(i) = global_optimization(cc-box(i)); end Find the global solution among l_sol(:); The final (serial) operation only consists of comparing n objective function values, and the work is negligible compared to the work performed within the parallel loop. This type of parallelization was also used in [4] for mapping of single QTL. In the grid portal implementation we use the modified version of the global optimization scheme DIRECT presented in [12]. To increase the efficiency of the parallelization technique presented above, we use the block-cyclic partitioning of the cc-boxes presented in [8]. In this case, the n independent tasks corresponding to global optimization in single cc-boxes are lumped together to m independent tasks, m < n, where m is a parameter that is chosen by the user. 2 Grid computing and grid portals Grid technology promises to change the way we tackle computational problems and use data and other resources across institutional boundaries. In the last decade, a number of research projects have been started with the goal of implementing grid computing. These include Globus (one of the underlying environments for many grid software packagers), GridPP, gLite and NorduGrid/ARC. The main concept behind the grid framework is the use of in-homogenous networks, commodity class computers and clusters for performing large scale computational tasks. This is especially valuable for institutions where funding for dedicated HPC hardware cannot be obtained easily. 4 Salman Toor, Mahen Jayawardena, Jonas Lindemann and Sverker Holmgren 2.1 Application portals for grid systems The grid middleware currently available are reliable and often used in many large projects like the LHC (Large Hadron Collider) at CERN. Still, more effort is required to make grid systems practical for the general user community. One of the major areas that limit the use of grid systems is that, infrastructure for application handling is missing. This problem becomes worse when the user needs to handle many jobs for a single application. So far, the general approach is that researchers write shell-scripts for job submission. This provides a solution but also diverts attention from the original task. Portals such as gridsphere, CrossGrids, and GridBlocks often provide almost all the functionality of the middleware. Since these portals are not specially designed for hosting grid applications, users need good technical knowledge of object-oriented concepts and web development in order to extend the portal for their own applications. For the QTL application presented in this paper, we use the Lunarc Application Portal (LAP). The LAP project is an effort to provide an application oriented web based environment, providing targeted user interfaces for commonly available applications. The portal currently provides user interfaces for applications, such as MATLAB, OCTAVE, ABAQUS (Structural analysis) and MOLCAS (Computational chemistry) (under development). The portal can also be viewed as a python-based framework for implementing web interfaces to user applications. Additional user interfaces are added as plugins to an installed portal instance. The implementation goals of the LAP framework have been: – Lightweight – Easy to understand without large dependencies on other libraries. Easy to deploy and maintain. – Extendible – It should be easy to extend the portal using a built in pluginarchitecture. – Customizable – The graphical design should be customizable to adapt to existing web designs. – Available – The next release will be available under an open source license (GPL). The portal framework is implemented using the python web application framework Webware. This is a lightweight framework for developing objectoriented web applications. The framework contains design patterns for applications servers, server pages, servlets, session management and many other features. The framework is modular and easily extendable. The portal application server is integrated with the Apache webserver using a special Apache module, mod webkit provided with Webware. For security reasons the Apache webserver serves the web pages using the HTTPS protocol. Access to grid resources is implemented using the ARC middleware. Currently the interface is implemented using the client command line tools in ARC, A Grid portal for Genetic Mapping of QTL 5 this approach is gradually being replaced by a direct python binding to the ARCLib library. The general architecture of the portal is illustrated in figure 1. Currently work has been initiated to separate the job management to a separate Fig. 1. General implementation architecture of the portal. service. This service will handle job submission, monitoring and resubmission. The job service will also be able to submit jobs to other non-ARC-based grids.
منابع مشابه
Computational and Visualization tools for Genetic Analysis of Complex Traits
We present grid based tools for simultaneous mapping of multiple locations (QTL) in the genome that affect quantitative traits (e.g. body weight, blood pressure) in experimental populations. The corresponding computational problem is very computationally intensive. We have earlier shown that, using appropriate parallelization schemes, this type of application is suitable for deployment on grid ...
متن کاملUsing Parallel Computing and Grid Systems for Genetic Mapping of Multifactorial Traits
We present a flexible parallel implementation of the exhaustive grid search algorithm for multidimensional QTL mapping problems. A generic, parallel algorithm is presented and a two-level scheme is introduced for partitioning the work corresponding to the independent computational tasks in the algorithm. At the outer level, a static blockcyclic partitioning is used, and at the inner level a dyn...
متن کاملUsing Parallel Computing and Grid Systems for Genetic Mapping of Quantitative Traits
We present a flexible parallel implementation of the exhaustive grid search algorithm for multidimensional QTL mapping problems. A generic, parallel algorithm is presented and a two-level scheme is introduced for partitioning the work corresponding to the independent computational tasks in the algorithm. At the outer level, a static blockcyclic partitioning is used, and at the inner level a dyn...
متن کاملGenetic Mapping of Blooming Time in ‘Marcona’ × ‘Fragness’ Population with Using Molecular Markers
Flowering time is an important horticultural trait in almond since it is essential to avoid the late frosts that affect production in early flowering cultivars. Evaluation of this complex trait is a long process because of the prolonged juvenile period of trees and the influence of environmental conditions affecting gene expression year by year. In this research flowering time was studied in an...
متن کاملIdentification of QTLs for grain yield and some agro-morphological traits in sunflower (Helianthus annuus L.) using SSR and SNP markers
Many agriculturally important traits are complex, affected by many genes and the environment. Quantitative trait loci (QTL) mapping is a key tool for studying the genetic structure of complex traits in plants. In the present study QTLs associated with yield and agronomical traits such as leaf number, leaf length, leaf width, plant height, stem and head diameter were identified by using 70 recom...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007